Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GC content parameter --mingc --maxgc #30

Merged
merged 4 commits into from
Apr 26, 2024

Conversation

JMencius
Copy link
Contributor

Hi @wdecoster ,
Recently, I added parameters for GC content filter --mingc and --maxgc for my own project usage. I also added a testGC.fastq for testing under /chopper/test-data.
It is very useful to deal with high-GC bacteria ONT sequencing.
Hope u like my modifications.

@wdecoster
Copy link
Owner

Hi,

Thank you for contributing :-)

I am concerned that this makes things slower for users that don't need the GC filter. Maybe it should not be calculated if people are not filtering on it?
Maybe you could benchmark if your changes made things slower.

And this is the key part of the failed test:

image

Best,
Wouter

@JMencius
Copy link
Contributor Author

Yeah, the performance is an issue. I will make it an optional parameter then.
Sorry for the error, I forget to change it to f64 as 1.0 and 0.0.
I will change acccordingly in the next commit.

@JMencius
Copy link
Contributor Author

JMencius commented Apr 24, 2024

I updated the code to check if --mingc is not 0.0 or --maxgc is not 1.0 to see if user set the gc content filter. If the gc content filter is set, then calculate the gc content, otherwise just leave the gc content as 0.5 to pass the filter below.
I also updated the cal_gc function to use .len() rather than count the lenght one by one.
This should not affect the performace significantly if no gc content filter is set.

@wdecoster
Copy link
Owner

I have made some changes to how you implemented this. Would you agree that this is equivalent to your code?
Before I merge this, I want to ensure that the way this is implemented does not majorly impact performance (by benchmarking the current release against this).

@JMencius
Copy link
Contributor Author

Yeah, I will run some benchmark using huge fastq files.

@JMencius
Copy link
Contributor Author

JMencius commented Apr 26, 2024

Hi @wdecoster I ran some tests on a 44G FASTQ file from Drosophila melanogaster ONT sequencing and found some interesting results.

Data Size command Chopper 0.8.0 run time Add GC filter Chopper run time
DM.fastq 44G cat DM.fastq | ./chopper -q 10 -l 500 > DM_q10_l500.fastq 309.832 s 310.268 s
DM.fastq 44G ./chopper -q 10 -l 500 --input DM.fastq > DM_q10_l500.fastq 282.506 s 281.578 s
DM.fastq 44G ./chopper -q 10 -l 500 --input DM.fastq --mingc 0.5 > DM_q10_l500_gchalf.fastq - 382.220 s
DM.fastq.gz 21G gunzip -c DM.fastq.gz | ./chopper -q 10 -l 500 > DM_q10_l500.fastq 652.202 s 675.225 s
DM.fastq.gz 21G ./chopper -q 10 -l 500 --input DM.fastq.gz > DM_q10_l500.fastq 3077.761 s 3060.424 s
DM.fastq.gz 21G ./chopper -q 10 -l 500 --input DM.fastq.gz --mingc 0.5 > DM_q10_l500_gchalf.fastq - 2931.972 s

I found that the GC filter does not significantly affect the run time. The --input was also benchmarked, and for non-compressed fastq, --input has sightly better performance but for the .fastq.gz file is way below, I think the GC filter part is good but there is something to optimize for the straightforwad compressed file.

@wdecoster
Copy link
Owner

wdecoster commented Apr 26, 2024

Oh yes indeed, that is interesting. I agree that the GC calculation doesn't slow things down when it is not needed, so that is great. Out of curiosity, did you also benchmark it when you did specify a min or max gc?

It seems gunzip does a far better job than the decompression done by Rust. I wonder if it is because you are then running the decompression in another process, piping the output, but I don't know how things get implemented in flate2. But things like this are also precisely the reason why I initially chopper only read from stdin. Unix tools are intended to do one thing well, and just one thing, and piping tools into the next tool is a great approach. I will add a note regarding your findings to the README.

add note on speed of decompression while piping
@wdecoster wdecoster merged commit eeb5438 into wdecoster:master Apr 26, 2024
1 check passed
@JMencius
Copy link
Contributor Author

Yeah I added the test for --mingc implement above, I think the performace drop is not great for using only 4 cores.

I wonder if it is because you are then running the decompression in another process, piping the output, but I don't know how things get implemented in flate2

Maybe I will deal with this problem the --input performance problem for compressed file some day, to make use of multi-core decompressing. I suppose one can run pigz -d -c X.fastq.gz | chopper -q 10 -l 500 > X_q10_l500.fastq, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants